Reliability and validity of the Roche PD Mobile Application for remote monitoring of early Parkinson’s disease

Digital health technologies enable remote and therefore frequent measurement of motor signs, potentially providing reliable and valid estimates of motor sign severity and progression in Parkinson’s disease (PD). The Roche PD Mobile Application v2 was developed to measure bradykinesia, bradyphrenia and speech, tremor, gait and balance. It comprises 10 smartphone active tests (with ½ tests administered daily), as well as daily passive monitoring via a smartphone and smartwatch. It was studied in 316 early-stage PD participants who performed daily active tests at home then carried a smartphone and wore a smartwatch throughout the day for passive monitoring (study NCT03100149). Here, we report baseline data. Adherence was excellent (96.29%). All pre-specified sensor features exhibited good-to-excellent test–retest reliability (median intraclass correlation coefficient = 0.9), and correlated with corresponding Movement Disorder Society–Unified Parkinson's Disease Rating Scale items (rho: 0.12–0.71). These findings demonstrate the preliminary reliability and validity of remote at-home quantification of motor sign severity with the Roche PD Mobile Application v2 in individuals with early PD.

Sensor feature sensitivity to side differences. Sensor features values from all lateralized tests demonstrated significant differences between the most and least affected sides (Supplementary Table 1). Moreover, sensor features and MDS-UPDRS scores measuring the same motor sign on the less affected (or more affected) side were more strongly correlated than sensor features and MDS-UPDRS scores measuring the same motor sign on opposite sides of the body (Fig. 6).

Discussion
Reliability and validity of the Roche PD Mobile Application v2. The Roche PD Mobile Application v1 was designed to measure the core motor signs of PD 5,18 , and was recently revised to v2 to primarily include two new active tests of bradykinesia (Hand Turning, Draw A Shape), as well as a test of psychomotor slowing (eSDMT) and a speech test. In addition, the original gait task was revised to a U-turn test, and a smartwatch was incorporated into the remote passive monitoring procedure. Preliminary test-retest reliability scores for the pre-specified sensor features from all active tests except Speech and eSDMT, and for both passive monitoring measures, were in the 'excellent' range 22 21 . Despite methodological differences, studies of these DHT tasks generally showed good correspondence between finger tapping sensor features and respective clinical ratings, as well as the ability to differentiate healthy controls from individuals with early PD, and individuals with early PD from individuals with later-stage PD 5,18,[25][26][27][28] , in line with the present findings. While the literature on digitized pronation/supination assessments is less rich than for finger tapping, available results also consistently demonstrate correlations with related clinical scores and the ability to differentiate healthy participants from individuals with PD 16,23,[29][30][31] . Spiral drawing is traditionally used in behavioral neurology to assess fine motor impairment including bradykinesia and tremor [32][33][34][35] . DHT versions of spiral drawing demonstrated that time to completion correlated with clinician ratings of bradykinesia severity, and differentiated PD cases from controls 34 . The majority of previous DHT spiral drawing tasks used pens/digital pens to draw on regular paper or tablets, a more challenging motor task compared with the present finger drawing on smaller smartphone touch screens.
In the present study, celerity, i.e. accuracy/time to complete spiral shape tracing on the smartphone screen, was pre-specified to additionally consider the accuracy of directed fine motor movements in the unsupervised at-home setting. Spiral celerity correlated with MDS-UPDRS bradykinesia measures, and the strength of these correlations was numerically smaller compared with Finger Tapping and Hand Turning. This may be due to the relative difficulty of the latter two tasks compared with spiral drawing, which may have challenged individuals more, thereby revealing greater impairment. We note that additional sensor features (e.g. variability in drawing www.nature.com/scientificreports/ speed, hesitation), analyzed either individually or combined within and across shapes, are expected to provide additional meaningful information, as has been shown for PD and multiple sclerosis 36,37 .
Passive monitoring with smartwatches. Passive monitoring with smartwatches provides a unique opportunity to explore slowing of upper limb movements during daily life. Here, sensor data segments during arm movements were identified from the circa 90% non-walking periods in the passive monitoring sensor data stream, using the squared magnitude of the accelerometer sensor movement as the sensor feature. This same feature has been related to decreased expressivity in patients with schizophrenia with negative symptoms 38 . Here, www.nature.com/scientificreports/ arm movement power was specifically related to the MDS-UPDRS bradykinesia subscore and item scores, as well as the rigidity subscore, and is in line with a slowing of hand movement in daily non-gait-related activities such as gesturing when speaking, eating, etc. These findings are consistent with previous research with wristworn wearables, which traditionally focused on arm swing during gait [39][40][41] , as well as multi-sensor systems used to measure the impact of bradykinesia on activities of daily living 15,42 . Thus, passively monitored motor behavior in daily life may facilitate our understanding of the effect and burden of PD on individuals' daily lives.
DHT measurement of bradyphrenia. The eSDMT 43 is commonly applied to measure psychomotor slowing, or bradyphrenia, one of the earliest cognitive signs in PD, appearing up to 5 years prior to a PD dementia diagnosis 20 . However, as the test requires multiple cognitive functions, it is not surprising that it is sensitive to many forms of neurologic impairment 44 . Indeed, while SDMT performance is reduced in PD 45 , impairments are exacerbated in individuals with PD with concomitant vascular 46 and amyloid 47 imaging findings. A standard SDMT outcome measure, number of correct responses in 90 s, was pre-specified for the present analyses of the eSDMT, and showed 'good' 22 test-retest reliability (ICC = 0.75). However, it correlated only weakly (rho = −0.18) with the MDS-UPDRS item 1.1. assessing global cognitive impairment. This finding is surprising given the catch-all nature of both the eSDMT and MDS-UPDRS item 1.1., but may be accounted for by the fact that cognitive impairments were excluded during the screening process in the PASADENA study, leading to a truncation of range in both scores (see Supplementary Fig. 1). We note that we attempted to minimize the effect of bradykinesia on eSDMT scores by requiring a simple tap response on a number pad displayed at the bottom half of the smartphone screen. Nevertheless, to mitigate the risk of this confound, eSDMT performance could be controlled by a non-cognitively demanding motor test using a similar response format.
DHT measurement of voice and speech. Voice and speech impairments in PD are varied and generally summarized under the term dysarthria, and include resonatory, articulatory, phonatory, prosodic and respiratory components 48 . This symptomatology and its relevance to patients' daily lives motivated the inclusion of a Sustained Phonation task in the suite of active tests, and the development of the novel Speech test. Voice jitter was pre-selected as a proxy of disordered vocal fold function for the sustained phonation test. In line with previ-  The bulbar MDS-UPDRS Part III composite item score was designed to gauge the severity of motor impairments in body parts involved in speech production. Despite a truncation of range in this score (average < 3/20 points), MFCC2 variability correlated with the bulbar score, indicating that this feature may estimate the severity of motor impairments in the speech apparatus. Future research will investigate further richly multi-faceted aspects of speech function to better understand motor and cognitive behavior in PD.
DHT measurement of tremor, turning and balance. The Roche PD Mobile Application v2 aims to assess the broad array of motor signs in PD and related movement disorders. Thus, besides bradykinesia, speech, voice, and psychomotor slowing, tremor (rest, postural), turning during gait, and balance were also assessed. The rest and postural tremor active test features corresponded most strongly to the respective MDS-UPDRS concepts of tremor, as demonstrated by the highest correlation overall with any MDS-UPDRS item and subscale scores. This is consistent with similar DHT reports 5,25,50 . The novel U-turn test (which instructed individuals, if safe to do so, to walk several paces and make a U-turn at least five times) and the identification of turning while walking throughout the day in passive monitoring sensor data, were motivated by findings that turning is particularly impaired in PD 5,51,52 . For example, a 360 degree walking turn and instrumented timed-up-and-go test showed strong reliability and discriminated controls from PD participants 53,54 . Similarly, sensor-based measures of turn speed in daily life differentiated PD individuals from controls 55 . In the present study, turn speed measured in both the active test and passive setting correlated with MDS-UPDRS 3.14. body bradykinesia item scores, but was not specifically related to MDS-UPDRS PIGD relative to other subscores. While neither measure of turn speed differentiated between less and more affected individuals on MDS-UPDRS body bradykinesia scores of 0 versus 1, both differentiated between individuals in Hoehn and Yahr Stage I versus II. Although participants were not instructed to 'turn as fast as possible' to ensure a safe conduct of the active test, the U-turn test showed numerically higher correlations with body bradykinesia compared with passive turning speed, in line with similar profile of performance (active testing) versus capacity (passive monitoring) scores previously demonstrated for gait speed 56 . In the balance active test, the jerk sensor feature correlated with the MDS-UPDRS 3.12. postural stability item score, similar to previous reports 5, 57 , and differentiated individuals with MDS-UPDRS item 3.12 scores of 0 versus 1, but failed to differentiate individuals in Hoehn and Yahr Stage I versus II. We speculate that www.nature.com/scientificreports/ this negative finding may reflect the low levels of gait and postural instability impairments in the present cohort (mean PIGD = 1).

DHT composite scores. A composite summary score of individual features across diverse assessments is
expected to provide a more robust measure of global PD severity and progression, especially given the heterogeneous nature of PD. Several DHT solutions besides the Roche PD Mobile Application v2 administer different motor active tests, and some additionally collect passive monitoring data 5,17,18 . Supplementary Table 2 provides a high-level comparison of these DHT solutions. All solutions contain active tests for tremor and tapping, but vary with respect to the inclusion of other upper limb, postural stability/gait, cognition, and voice/speech tests, and whether passively monitored motor data are collected. The power of combining different features across the tests in these DHTs has been shown via machine learning models that predict MDS-UPDRS total scores (Roche PD Mobile Application v1) 58 or lead to a new score based on differentiation of ON and OFF L-dopa states 59 , and distinguished between healthy controls, idiopathic Rapid Eye Movement and PD 16,60 . A machine learning approach was also used to combine different HopkinsPD baseline sensor features to predict clinically significant events (e.g. falls, functional impairment) at the 18-month follow-up 61 . In contrast to data-driven approaches to composite score development, a clinical outcomes assessment approach could be applied whereby information from individuals with PD informs the selection of sensor features such that they optimally reflect what matters most to patients 62 .
Limitations. Several facets of the present study limit the generalizability of the findings. Firstly, all individuals' disease duration was < 2 years, and individuals were in Hoehn and Yahr Stages I or II. Thus, the applicability of the present findings to later-stage or prodromal PD is unknown. The reduced range of disease severities also appeared to limit the ranges of some DHT and clinical measures, which consequently limited the possibility to detect relationships between the two (Supplementary Fig. 1). Also, further research is necessary to better understand the suitability of this remote monitoring approach for later-stage patients with more severe cognitive or visual impairments. Second, since Roche PD Mobile Application v2 data are not yet available from neurologically normal individuals, sensor feature cut-off values differentiating normal from impaired motor behavior could not yet be calculated. It should be also noted that comparisons between DHT measures and clinical measures such as the MDS-UPDRS can also be affected by limitations in the clinical measures; if an active test is not adequately reflected by a clinical measure, the ability to detect meaningful correlations is reduced. Finally, only two continuous 2-week periods of DHT data were analyzed; thus, the long-term adherence to the remote monitoring procedure and ability of sensor features to detect changes over time remain to be established. Towards this end, it is critical to quantify and report test-retest reliabilities of sensor feature scores towards assessing a sensor feature's potential to detect changes over time 63 and any deviation from normal progression as a function of e.g. pharmacological interventions. The Roche PD Mobile Application v2 was designed to measure the severity of early PD core motor signs and to provide information complementary to established clinical outcome measures. This remote monitoring approach enables high-frequency (i.e. daily) assessments with low average daily burden. The frequent measurement coupled with the high sensitivity of smartphone/smartwatch sensors may increase signal-to-noise of digital outcome measures for clinical research and provide novel insights into patients' functioning in daily life.

Methods
Participants. Baseline Roche PD Mobile Application v2 data from 316 dopaminergic-treatment-naïve individuals recently diagnosed with dopamine transporter imaging with single-photon emission computed tomography-confirmed PD (Hoehn and Yahr Stages I-II, diagnosis ≤ 2 years) were analyzed (see Table 2 for demographic and clinical characteristics, and Supplementary Table 6 for non-parametric descriptive statistics). All individuals were enrolled in an ongoing randomized, double-blind, placebo-controlled, Phase II clinical trial (PASADENA Part 1; NCT03100149) of prasinezumab (RO7046015/PRX002), an anti-α-synuclein monoclonal antibody (see Pagano et al. 2021) 64 .
All of the 59 PASADENA sites received approval from their institutional review boards or ethics committees and collected data used for the present analyses, and written informed consent was provided by all participants. The study is being conducted in accordance with the Declaration of Helsinki and the International Conference on Harmonization Guidelines for Good Clinical Practice.
Roche PD Mobile Application v2. The Roche PD Mobile Application v2 consists of dedicated applications installed on a provisioned smartphone and smartwatch (see Fig. 1). The PD Mobile Application prompted participants to perform the active tests described below. All unilateral tests were performed twice, once with each side of the body. www.nature.com/scientificreports/ 5. Phonation participants were instructed to make a single, continuous "aaaah" sound for as long as possible with one breath and in a steady pitch and volume while the phone was held at the ear (timeout: 30 s); 6. Postural tremor participants were instructed to sit with their eyes closed, and to hold the smartphone in an outstretched hand while counting down out loud from a pre-specified number that differed for each test administration (15 s per hand); 7. Rest tremor participants were instructed to sit with their eyes closed and to hold the phone in the palm of their hand, with their forearm resting on their thigh, and to count down out loud from a pre-specified number that differed for each test administration (15 s per hand); 8. Balance while standing with the smartphone in a running belt at waist height with the phone placed at the front of the body, participants were instructed to stand still with their arms at their side (30 s); 9. U-turn participants were instructed to place the smartphone in a running belt with the phone placed at the front of the body, and to walk between two points at least four steps apart at normal speed, completing at least five turns in 60 s; 10. eSDMT 43 participants were instructed to match a sequence of displayed symbols to the respective numbers using a displayed coding key as quickly and as accurately as possible (90 s).
The Roche PD Mobile Application v2 additionally administered questionnaires, which are not the focus of the present report. For passive monitoring, participants were instructed to carry their smartphone (e.g. in their trouser pocket or in the pouch of a provided running belt) and wear their smartwatch as they conducted the daily active tests and their normal daily activities. No active tests were administered directly via the smartwatch.
Procedure. Participants were provided with an Android smartphone (Galaxy S7, Samsung, Seoul, South Korea) and smartwatch (Moto G 360 2nd Gen Sport; Motorola, Chicago, USA) during a screening visit at the latest 7 days prior to the baseline clinical visit, and trained on the use of the devices and the Roche PD Mobile Application v2. Participants were instructed to open the application on the provisioned smartphone every morning. Active tests were scheduled automatically such that half of the motor tests were presented on alternating days, and the eSDMT every 2 weeks (Fig. 1), with a total expected testing time (per day) including transitions and test-start countdowns between tests of 5-10 min (including eSDMT). All data were stored in encrypted files on the smartphone and sent by WiFi to a cloud storage facility each time the smartphone connected to the Internet.
Baseline clinical assessments included the MDS-UPDRS 8 , from which subscale scores were generated (i.e. PIGD, bradykinesia, rigidity, tremor) 28 The MDS-UPDRS was administered according to standardized procedures, and all MDS-UPDRS raters completed online MDS-UPDRS training by the MDS. Additionally, a 'bulbar score' was defined as the composite sum of MDS-UPDRS items 2.1 Speech; 2.2 Saliva and Drooling; 2.3 Chewing and Swallowing; 3.1 Speech; and 3.2 Facial expression. www.nature.com/scientificreports/ Sensor data processing. The raw sensor data from the smartphone and smartwatch were extracted and processed using a dedicated internally developed backend infrastructure. Custom algorithms implemented in Python were applied on quality-controlled sensor data (e.g. for correct test execution) and converted data into pre-defined 'sensor features' , one per active test performed and side of body (if applicable) and one for passive monitoring. Features were selected based on previous literature and their relevance to PD (Supplementary  Table 3).
1. Draw a shape: The feature Spiral celerity combines drawing accuracy and drawing speed (accuracy/speed) of the spiral drawing. 2. Dexterity: Tapping variability quantifies the variability of tapping speed as measured by the standard deviation of the time between consecutive tap events. This feature has already shown positive results in a previous study in PD 5 . 3. Hand turning: Median hand turning speed is the median turn speed over all segmented hand rotations to estimate bradykinesia. 4. Speech: The MFCC2 is the ratio between vocal tract resonation of the high and vocal fold vibration of the low Mel-frequency bands affected in PD. MFCC values have already shown case/control differences in other studies of PD 5 . For this novel feature, MFCC2s of consecutively voiced speech segments (i.e. parts of speech that are longer than 200 ms and are acoustically distinguishable) were calculated and averaged over each segment. MFCC2s are interpreted as a measure of speech monotonicity. 5. Phonation: Voice jitter is defined as a mean of the absolute differences between the period of adjacent pitch cycles, normalized by the mean pitch period, multiplied by 100. This definition of jitter is also referred to as jitter:local 49 . Jitter represents the variability of the speech fundamental frequency (pitch period) from one cycle to another and is a measure of micro-instability of vocal fold vibration, where higher values of the feature indicate higher instability of vocal fold vibration. 6. Rest and postural tremor: Log median squared energy measures the average acceleration magnitude, which is a proxy for the average amplitude during tremor induced by hand movements when trying to hold the hand still. A similar feature showed clinical validity for PD in a previous study 5 . 7. Balance: Log sway jerk describes the jerkiness (i.e. irregular, non-smooth accelerations) of movements when trying to stand still and may be a marker of disease progression in PD 65 . 8. U-turn: Median turn speed describes the average turn speed of all turns completed during the U-turn test.
The same feature correlated with clinically assessed gait impairment in previous studies in individuals with PD and multiple sclerosis 5 . 9. SDMT: Number of correct responses is the standard feature also reported in the traditional paper-based in-clinic SDMT 43 . 10. Passive monitoring-gait: Median turn speed in passive monitoring is the average turn speed of all turns detected over a given day of smartphone sensor recording, and had previously demonstrated discriminability between individuals with PD and control participants 66 . 11. Passive monitoring-non-gait arm movements 67 : Median arm movement power (non-gait) is the median of the integrated squared acceleration magnitude (i.e. power) over all identified arm movements during non-gait data segments in a given day of sensor recording with the smartwatch. As such, it measures the intensity of arm movements during activities of daily living (gesturing when speaking, grabbing something, etc.) that do not occur during periods of walking (i.e. does not reflect arm swing while walking). Here, we hypothesize that a reduced intensity of arm movements is associated with bradykinesia.
All features reported in this manuscript are based on smartphone sensor data, with the exception of 'arm movement power (non-gait)' , which leverages passively acquired sensor data collected with the smartwatch.
Data underwent quality control (QC) checks to ensure that the tests had been performed properly. Towards this end, QC metrics were generated. For example, one QC metric quantified the amount of energy from the accelerometer during the Hand Turning test to estimate whether the smartphone was lying still (e.g. on a table) or moving during the test. 0.3% (n = 179/56,786) of digital active test data did not meet the pre-specified QC thresholds and were therefore excluded from the analyses.

Statistical analyses.
Sensor features from passive monitoring and each active test performed were summarized (median) over 2-week intervals starting at the baseline visit (Weeks 1 and 2) and in the 2-week period thereafter, provided that ≥ 3 data points were available during each 2-week testing interval. Where applicable, sensor features were assigned to less/more affected side (for definition see Supplementary Material). For convergent/divergent validity (i.e. degree of association with related/unrelated symptom domains), the averaged (median) sensor data collected during the first two study weeks were compared with clinical data collected at the baseline visit (Day 1) using Spearman's correlations. Adherence and test-retest metrics were calculated for aggregated sensor features for the first two 2-week study periods. Adherence was defined as the number of fully completed active testing sessions relative to the number of all possible active testing sessions. For passive monitoring, also calculated over the first two 2-week study periods, the average number of hours per day participants carried the provisioned study smartphone with them and wore the study smartwatch was calculated. Sensor feature test-retest reliabilities were quantified with the ICC between averaged values of the first and second contiguous 2-week periods. To investigate the sensitivity of sensor features to subtle or very early symptoms, sensor features from participants receiving MDS-UPDRS item scores of 0 versus 1 were compared using Mann-Whitney U tests. For known-groups validity (i.e. differences between pre-defined groups where a difference is www.nature.com/scientificreports/ prima facie expected), sensor features were compared between participants in Hoehn and Yahr Stage I versus II, and by comparing sensor feature values from less and more affected sides, both using Mann-Whitney U tests.